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University of Pennsylvania, University of Pennsylvania, University 
of Pennsylvania and Purdue University 

Variance function estimation in nonparametric regression is con- 
sidered and the minimax rate of convergence is derived. We are par- 
ticularly interested in the effect of the unknown mean on the esti- 
mation of the variance function. Our results indicate that, contrary 
to the common practice, it is not desirable to base the estimator of 
the variance function on the residuals from an optimal estimator of 
the mean when the mean function is not smooth. Instead it is more 
desirable to use estimators of the mean with minimal bias. On the 
other hand, when the mean function is very smooth, our numerical 
results show that the residual-based method performs better, but not 
substantial better than the first-order-difference-based estimator. In 
addition our asymptotic results also correct the optimal rate claimed 
in Hall and Carroll [J. Roy. Statist. Soc. Ser. B 51 (1989) 3-14]. 

1. Introduction. Consider the heteroscedastic nonparametric regression 
model 

(1) yi = f{xi) + V 1/2 (xi)zi, i = l,...,n, 

where Xi = i/n and Zi are independent with zero mean, unit variance and 
uniformly bounded fourth moments. Both the mean function / and variance 
function V are defined on [0, 1] and are unknown. The main object of inter- 
est is the variance function V. The estimation accuracy is measured both 
globally by the mean integrated squared error 

(2) R{V,V) = E {V{x)-V{x)) 2 dx 
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and locally by the mean squared error at a point 

(3) R(V(x*),V(x*)) = E(V(x*) - V{x*)) 2 . 

We wish to study the effect of the unknown mean / on the estimation of 
the variance function V . In particular, we are interested in the case where 
the difficulty in estimation of V is driven by the degree of smoothness of the 
mean /. 

The effect of not knowing the mean / on the estimation of V has been 
studied before in Hall and Carroll (1989). The main conclusion of their 
paper is that it is possible to characterize explicitly how the smoothness 
of the unknown mean function influences the rate of convergence of the 
variance estimator. In association with this they claim an explicit minimax 
rate of convergence for the variance estimator under pointwise risk. For 
example, they state that the "classical" rates of convergence (n _4//5 ) for the 
twice differentiable variance function estimator is achievable if and only if 
/ is in the Lipschitz class of order at least 1/3. More precisely, Hall and 
Carroll (1989) stated that, under the pointwise mean squared error loss, 
the minimax rate of convergence for estimating V is 

(4) max{ n - 4a /( 2a+1 ),n-W^ +1 )} 

if / has a derivatives and V has (3 derivatives. We shall show here that this 
result is in fact incorrect. 

In the present paper we revisit the problem in the same setting as in Hall 
and Carroll (1989). We show that the minimax rate of convergence under 
both the pointwise squared error and global integrated mean squared error 
is 

(5) max{n- 4Q ,n- 2 ^ 2/m )} 

if / has a derivatives and V has (3 derivatives. The derivation of the minimax 
lower bound is involved and is based on a moment matching technique and 
a two-point testing argument. A key step is to study a hypothesis testing 
problem where the alternative hypothesis is a Gaussian location mixture 
with a special moment matching property. The minimax upper bound is 
obtained using kernel smoothing of the squared first order differences. 

Our results have two interesting implications. First, if V is known to 
belong to a regular parametric model, such as the set of positive polynomials 
of a given order, the cutoff for the smoothness of / on the estimation of V 
is 1/4, not 1/2 as stated in Hall and Carroll (1989). That is, if / has at 
least 1/4 derivative then the minimax rate of convergence for estimating V 
is solely determined by the smoothness of V as if / were known. On the 
other hand, if / has less than 1/4 derivative then the minimax rate depends 
on the relative smoothness of both / and V and will be completely driven 
by the roughness of /. 
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Second, contrary to the common practice, our results indicate that it is 
often not desirable to base the estimator V of the variance function V on 
the residuals from an optimal estimator / of the mean function / when / 
is not smooth. Instead it is more desirable to use estimators of the mean 
/ with minimal bias. The main reason is that the bias and variance of / 
have quite different effects on the estimation of V. The bias of / cannot 
be removed or even reduced in the second stage smoothing of the squared 
residuals, while the variance of / can be incorporated easily. On the other 
hand, when the mean function is very smooth, our numerical results show 
that the residual-based method performs better, but not substantial better 
than the first-order-difference-based estimator. 

The paper is organized as follows. Section 2 presents an upper bound for 
the minimax risk while Section 3 derives a rate-sharp lower bound for the 
minimax risk under both the global and local losses. The lower and upper 
bounds together yield the minimax rate of convergence. Section 4 discusses 
the obtained results and their implications for practical variance estimation 
in the nonparametric regression. Section 5 considers finite sample perfor- 
mance of the difference-based method for estimating the variance function. 
The proofs are given in Section 6. 

2. Upper bound. In this section we shall construct a kernel estimator 
based on the square of the first order differences. Such and more general 
difference based kernel estimators of the variance function have been con- 
sidered, for example, in Miiller and Stadtmuller (1987, 1993). For estimating 
a constant variance, difference based estimators have a long history. See von 
Neumann (1941, 1942), Rice (1984), Hall, Kay and Titterington (1990) and 
Munk, Bissantz, Wagner and Freitag (2005). 

Define the Lipschitz class A a (M) in the usual way, 

A a (M) = {g:for all < x, y < 1, k = 0, . . . , |aj - 1, 

\g (k) (x)\ < M and |<? (LqJ) (x) - <? (L ° J) (y)| < M\x - y\ a '}, 

where [a\ is the largest integer less than a and a' = a — \a\ . We shall 
assume that / G A a (M/) and V G A' 3 (My). We say that the function / "has 
a derivative" if / G A a (M f ) and V "has [3 derivatives" if V G A^(My). 
For i = 1, 2, . . . ,n — 1, set Di=yi — Then one can write 

(6) A = f{xi) - f(x i+1 ) + V^ix^Zi - V 1 / 2 (x i+1 )z i+1 = 5i + V2Vl /2 e h 

where ^ = f(xi) - f(x i+ i), V^ /2 = \J\{V{xi) +V(x i+ i)) and 

ei = (V( Xi ) + Vix^y^iV^ix^ - V l l 2 {x i+l )z l+1 ) 
has zero mean and unit variance. 
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We construct an estimator V by applying kernel smoothing to the squared 
differences Df which have means 5f + 2V^. Let K(x) be a kernel function 
satisfying 

K{x) is supported on [—1,1], J K(x) dx = 1, 



K(x)x i dx = Q for i = 1,2, ... ,[f3\ and 



K 2 {x) dx = k < oo. 



It is well known in kernel regression that special care is needed in order 
to avoid significant, sometimes dominant, boundary effects. We shall use 
the boundary kernels with asymmetric support, given in Gasser and Miiller 
(1979, 1984), to control the boundary effects. For any t G [0, 1], there exists 
a boundary kernel function Kt{x) with support [— l,t] satisfying the same 
conditions as K{x), that is, 



j^K t (x)dx = l, 



K t (x)x i dx = for i = 1,2,..., 



Kt(x) dx<k<oo for all t G [0, 1]. 



We can also make Kt(x) — > K(x) as t — > 1 (but this is not necessary here). 
See Gasser, Miiller and Mammitzsch (1985). For any < h < 1/2, x G [0, 1], 
and i = 2, . . . , n — 2, let 



KHx) 



h 

x — u 



(xi+x i+ i)/2 I ( x — U 

Si+Xi_i)/2 ll 
(xi+x i+1 )/2 i 

-K t 

(xi+Xi_i)/2 h 
(xi+x i+ i)/2 1 

(an+xi-i)/2 n 



h 

x — u 

h 



du, 
du, 



when x G (h, 1 — h), 

when x = th for some t G [0, 1], 



du, 



when x = 1 — th for some t G [0, 1], 



and we take the integral from to (x\ + x<l)/2 for i = l, and from (x n _i + 
x n _2)/2 to 1 fori =n—l. Then we can see that for any < x < 1, J2i=i ^i( x ) 
1. Define the estimator V as 



71-1 



(7) 



8=1 
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Same as in the mean function estimation problem, the optimal bandwidth 
h n can be easily seen to be h n = 0{rr x ^ 1+2 ^) for V G k (M v ). For this 
optimal choice of the bandwidth, we have the following theorem. 

Theorem 1. Under the regression model (1) where Xj = i/n and Zi are 
independent with zero mean, unit variance and uniformly bounded fourth 
moments, let the estimator V be given as in (7) with the bandwidth h = 

Q( n -l/(l+2P)y Then there exists 

some constant Co > depending only on 
a, (3, Mf and My such that for sufficiently large n, 



(8) 
and 

(9) 



sup sup E(V(x*) — V(x*)) 2 

feA"(M f ),VeAf>(M v ) 0<x„<l 

<C -max{n- 4a ,n- 2 ' 3 /( 1+2 ' 3 )} 

sup E f\v{x)-V{x)) 2 dx 

<Co-ma X {n- 4a ,n^/( 1+2 ^}. 



Remark 1. The uniform rate of convergence given in (8) yields imme- 
diately the pointwise rate of convergence that for any fixed point x* G [0, 1] 

sup E(V(x m ) - V(x,)) 2 < C • max{n- 4a , n" 2/3 /( 1+2/3 )}. 

feA a (M f ),veA^(M v ) 

Remark 2. It is also possible to use the local linear regression estimator 
instead of the Priestley-Chao kernel estimator. In this case, the boundary 
adjustment is not necessary as it is well known that the local linear regression 
adjusts automatically in boundary regions, preserving the asymptotic order 
of the bias intact. However, the proof is slightly more technically involved 
when using the local linear regression estimator. For details see, for example, 
Fan and Gijbels (1996). 

Remark 3. It is important to note here that the results given in Theo- 
rem 1 can be easily generalized to the case of random design. In particular, 
if the observations Xi, . . . ,X n are i.i.d. with the design density f(x) that is 
bounded away from zero (i.e., f{x) > 5 > for all x £ [0, 1]), then the results 
of Theorem 1 are still valid conditionally. In other words, 

sup sup 

feA a (M f ),VeA! 3 (M v )0<x t <l 

< Co • max{n- 4c \n- 2 ^ 1+2 «} + o p (m a x{ n - ia ,n~ 2 ^ 1+2 ^}) 
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and 

sup e( f\v(x) - V{x)fdx\X u . ..,X n ) 

feA<x(Mf),VeAP(M v ) \J0 J 

< Co ■ max{?i- 4a ,n- 2 ^( 1+2 «} + o p (max{n- 4c \ n" 2 ^ 1 ^}) 
where the constant Cq > now also depends on 6. 

3. Lower bound. In this section we derive a lower bound for the minimax 
risk of estimating the variance function V under the regression model (1). 
The lower bound shows that the upper bound given in the previous section is 
rate-sharp. As in Hall and Carroll (1989) we shall assume in the lower bound 

argument that the errors are normally distributed, that is, Zi iV(0, 1). 

Theorem 2. Under the regression model (1) with Z{ iV(0, 1), 

(10) inf sup E\\V -T/||^>Ci-max{n- 4Q ,n- 2/3/(1+2/3) } 

V feA<*(M f ),VeA0(M v ) 

and for any fixed x* G (0, 1) 

inf sup E(V(x*) - V{x*)) 2 

V f£A<*(M f ),V£AP(M v ) 

(11) 

>C 1 -max{n- 4a ,n- 2 ' 3 /( 1+2 ^}, 
where C\ > is a constant depending only on a, (3, Mf and My. 

It follows immediately from Theorems 1 and 2 that the minimax rate of 
convergence for estimating V under both the global and local losses is 

m^n-^.n-Wf 1 ^)}. 

The proof of this theorem can be naturally divided into two parts. The 
first step is to show 

(12) inf sup J E(y(x,)-y(x,)) 2 >C 1 n- 2 ' 3 /( 1+2 ^. 

V feA a (Mf),VeAP(M v ) 

This part is standard and relatively easy. Brown and Levine (2006) contains 
a detailed proof of this assertion for the case (3 = 2. Their argument can be 
easily generalized to other values of (3. We omit the details. 
The proof of the second step, 

(13) inf sup E(V(x,)-V(x,)) 2 >C in - 4a , 

V f£A a (M f ),V£Af>(M v ) 

is much more involved. The derivation of the lower bound (13) is based on a 
moment matching technique and a two-point testing argument. One of the 
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main steps is to study a complicated hypothesis testing problem where the 
alternative hypothesis is a Gaussian location mixture with a special moment 
matching property. 

More specifically, let X±, . . . , X n P and consider the following hypoth- 
esis testing problem between 



where 9 n > is a constant and 67 is a distribution of the mean v with 
compact support. The distribution G is chosen in such a way that, for some 
positive integer q depending on a, the first q moments of G match exactly 
with the corresponding moments of the standard normal distribution. The 
existence of such a distribution is given in the following lemma from Karlin 
and Studden (1966). 

Lemma 1. For any fixed positive integer q, there exist a B < oo and a 
symmetric distribution G on [—B,B] such that G and the standard normal 
distribution have the same first q moments, that is, 



where (p denotes the density of the standard normal distribution. 

The moment matching property makes the testing between the two hy- 
potheses "difficult." The lower bound (13) then follows from a two-point 
argument with an appropriately chosen 8 n . Technical details of the proof 
are given in Section 6. 

Remark 4. For a between 1/4 and 1/8, a much simpler proof can be 
given with a two-point mixture for Pi which matches the mean and variance, 
but not the higher moments, of Po and Pi. However, this simpler proof fails 
for smaller a. It appears to be necessary in general to match higher moments 
of Po and Pi . 

Remark 5. Hall and Carroll (1989) gave the lower bound 
Cmax{n _4a// ( 1+2a ),ra _2/3 /( 1+2 ^} for the minimax risk. This bound is larger 
than the lower bound given in our Theorem 2 and is incorrect. This is due 
to a miscalculation on appendix C of their paper. A key step in that proof 
is to find some d > such that 



H :P = P = N(0,l+e 2 n ) 



and 






D = E{[l + exp$d + d 1 / 2 N 1 )]~ 1 $d + d 1 / 2 N 1 )} ^0. 
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In the above expression, N\ denotes a standard normal random variable. 
But in fact 

n _ /°° {l/2)d + d l / 2 x 1 /_^\ 

1 + exp((l/2)d + dWx) v^f 6XP \Y) dX 

x 1 W>" (1 { 2)d)2 U 



-oo 
oo 

-oo 



1 + exp(x) y/2lrd V 2d 



°° X 1 ex ( ^ d \dx 

-oo exp(x/2) +exp(-j;/2) V27ni eXP V 2d ~ 8/ X ' 

This is an integral of an odd function which is identically for all d. 

4. Discussion. Variance function estimation in regression is more typi- 
cally based on the residuals from a preliminary estimator / of the mean 
function. Such estimators have the form 

(14) v{x)=Y J Wi{x){y i -f{x i )) 2 

i 

where Wi(x) are weight functions. A natural and common approach is to 
subtract in (14) an optimal estimator / of the mean function f(x). See, 
for example, Hall and Carroll (1989), Neumann (1994), Ruppert, Wand, 
Hoist and Hossjer (1997), and Fan and Yao (1998). When the unknown 
mean function is smooth, this approach often works well since the bias in 
/ is negligible and V can be estimated as well as when / is identically 
zero. However, when the mean function is not smooth, using the residuals 
from an optimally smoothed / will lead to a sub-optimal estimator of V . 
For example, Hall and Carroll (1989) used a kernel estimator with optimal 
bandwidthfor / and showed that the resulting variance estimator attains 
the rate of 

(15) max{n- 4a /( 2Q+1 ),n-W(2/3+i) } 

over / S A a (Mf) and V £ A! 3 {My). This rate is strictly slower than the 

minimax rate when 2 a+i < 2/3+1 or equivalently, a < 2 p+2 ■ 

Consider the example where V belongs to a regular parametric family, 
such as {V{x) = exp(ax + b) : a, b G M}. As Hall and Carroll have noted, this 
case is equivalent to the case of (3 = oo in results like Theorems 1 and 2. 
Then the rate of convergence for this estimator becomes nonparametric at 
n ~4a/(2a+l) £ or Q \j2 ) while the optimal rate is the usual parametric rate 
n -1 / 2 for all a > j and is n~ 4a for < a < \. 

The main reason for the poor performance of such an estimator in the 
non-smooth setting is the "large" bias in /. An optimal estimator f of f 
balances the squared bias and variance. However, the bias and variance of 
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/ have significantly different effects on the estimation of V. The bias of 
/ cannot be further reduced in the second stage smoothing of the squared 
residuals, while the variance of / can be incorporated easily. For / G A Q (My), 
the maximum bias of an optimal estimator / is of order n~ a ^ 2a+1 ^ which 
becomes the dominant factor in the risk of V when a < 2/3+2 • 

To minimize the effect of the mean function in such a setting one needs 
to use an estimator /(xj) with minimal bias. Note that our approach is, 
in effect, using a very crude estimator / of / with f(xi) = y%+i- Such an 
estimator has high variance and low bias. As we have seen in Section 2, the 
large variance of / does not pose a problem (in terms of rates) for estimating 
V. Hence for estimating the variance function V an optimal / is the one with 
minimum possible bias, not the one with minimum mean squared error. [Here 
we should of course exclude the obvious, and not useful, unbiased estimator 
f{xi)=Vi] 

Another implication of our results is that the unknown mean function 
does not have any first-order effect for estimating V as long as / has more 
than 1/4 derivatives. When a > 1/4, the variance estimator V is essentially 
adaptive over / S A a (Mf) for all a > 1/4. In other words, if / is known to 
have more than 1/4 derivatives, the variance function V can be estimated 
with the same degree of first-order precision as if / is completely known. 
However, when a < 1/4, the rate of convergence for estimating V is entirely 
determined by the degree of smoothness of the mean function /. 

5. Numerical results. We now consider in this section the finite sam- 
ple performance of our difference-based method for estimating the variance 
function. In particular we are interested in comparing the numerical perfor- 
mance of the difference-based estimator with the residual-based estimator 
of Fan and Yao (1998). The numerical results show that the performance 
of the difference-based estimator is somewhat inferior when the unknown 
mean function is very smooth. On the other hand, the difference-based esti- 
mator performs significantly better than the residual-based estimator when 
the mean function is not smooth. 

Consider the model 1 where the variance function is V{x) = (x — ^) 2 + \ 
while there are four possible mean functions: 

(i) / 1 (x) = 0, 

(ii) / 2 (x) = |*sin(107rs), 

(iii) / 3 (a?) = f *sin(20^x), 

(iv) f 4 (x) = § * sin(407rx). 

The mean functions are arranged from a constant to much rougher sinusoid 
function; the "roughness" (the difficulty a particular mean function creates 
in estimation of the variance function V) is measured by the functional 
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Table 1 

Performance under the changing curvature of the mean function 



Median CDMSE 


Mean function 


R(f') 


Fan-Yao method 


Our method 


f = o 

/ = |sin(107ra;) 
/ = | sin(207T2;) 
/ = | sin(407ra;) 




278.15 
1110.89 
4441.88 


0.00299 
0.07161 
0.08435 
0.08363 


0.00376 
0.00344 
0.00384 
0.00348 



R(f') = J[f'(x)] 2 dx since the mean-related term in the asymptotic bias of 
the variance estimator V{x) is directly proportional to it. The numerical 
performance of the difference-based method had been investigated earlier in 
Levine (2006) for a slightly different set of mean functions. 

For comparison purposes, the same four combinations of the mean and 
variance functions are investigated using the two-step method described in 
Fan and Yao (1998). We expect this method to perform better than the 
difference-based method in the case of a constant mean function, but to 
get progressively worse as the roughness of the mean function considered 
increases. The following table summarizes results of simulations using both 
methods. In this case, the bandwidths for estimating the mean and vari- 
ance functions were selected using a if- fold cross-validation with K = 10. 
We consider the fixed equidistant design xi = ^ on [0, 1] where the sample 
size is n = 1000; 100 simulations are performed and the bandwidth h is se- 
lected using a if -fold cross-validation with K = 10. The performance of both 
methods is measured using the cross-validation discrete mean squared error 
(CDMSE) that is defined as 

n 

(16) CDMSE = n- l Y / [V hcv (x i )-V(x i )} 2 

i=l 

with hcv being the if -fold cross-validation bandwidth. We report the me- 
dian CDMSE for variance function estimators based on 100 simulations. 
Table 1 provides the summary of the performance. 

It is easily seen from the table that the two-step method of Fan and Yao, 
based on estimating the variance using squared residuals from the mean 
function estimation, tends to perform slightly better when the mean function 
is very smooth but noticeably worse when it is rougher. Note that here we 
only use the first-order differences. The performance of the difference based 
estimator can be improved in the case of smooth mean function by using 
higher order differences. The Fan-Yao method performs about 26% better 
in the first case of the constant mean function. However, the risk (CDMSE) 
of the difference based method is over 95% smaller than the risk of the 
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Fan-Yao method for the second mean function. In the rougher cases, the 
difference is approximately the same. The CDMSE of the difference based 
method is over 95% and 96% less than the corresponding risk of the residual 
based method for the third and fourth mean functions, respectively. 

6. Proofs. 

6.1. Upper bound: Proof of Theorem 1. We shall only prove (8). Inequal- 
ity (9) is a direct consequence of (8). Recall that 

D 2 = 6f + IV, + 2Vi(e* - 1) + 2V25 i vl / \ i , 

where S { = f{x t ) - f(x l+1 ), V-' 2 = ^l/2{V{ Xl ) + V(x l+1 )) and 

et = (V( Xi ) + K(s m ))" 1/2 (y 1/2 (xi)z( - V 1 / 2 (x l+1 )z l+1 ). 
Without loss of generality, suppose 

fc = n -l/(l+20). I t 

is easy to see that 

for any X* G [0, 1], J2i K i( x *) = 1> an ^ when x* > (xj + x i+1 )/2 + /i or x* < 
(x{ + Xj_i)/2 — h, K^{x*) equals 0. Suppose k < k, we also have 

(j2\KH**)\) <2nhJ2(KH**)) 2 



<2 J^K 2 (u)du 



< 2k, 

where K*{u) = K(u) when x* G (h, 1 — h); K*(u) = K t {u) when x* = th for 
some t G [0, 1]; and K*(u) = K t (—u) when x* = 1 — th for some t G [0, 1]. 

The second inequality above is obtained as follows. For the sake of sim- 
plicity, assume that K* = K; the same argument can be repeated for bound- 
ary kernels as well. Using the definition of if/ 1 (a:*), we note that it can be 

rewritten as /(j^jft nH-^C^Tp) d(nu). Since the last integral is taken 

with respect to the probability measure on the interval , we 

can apply Jensen's inequality to obtain 

, „ 1 r{xi+x i+1 )/2 n / x — u\ 

(i^(**)) 2 < 4/ )d(nu) 

1 f(xi+x i+1 )/2 f x -u\ 

K 2 —— du. 



Thus, 



(nh) 2 J^+xi-ii/z \ h 
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2 / K 2 (u)du. 



-l 



For all / G A a (Mj) and V G A^(My), the mean squared error of at 
satisfies 

^(y(^)-y(x*)) 2 

/n-l \ 2 



{n—l n— 1 

Y, KK**)\% + E Ki&M - V(x.)) 
i=l i=l 

n—l n—l 

+ KK**)Vi{£ - 1) + E Kfajy/tSiV^ei 

i=l i=l 
/n-l \ 2 /n-l \ ' 

<4 E W + 4 E W - v(*.)) 



'n-l 



/n-l 



4£?( £ ^(**)^ - 1) ) + 4E( E X?(x.)^2*^ 1/a c 



k j=l 



Suppose a < 1/4, otherwise n 4q < n 2 ^/( 1+2 ^) for any /?. Since for any i, 
\Si\ = \f(xi) - f(x l+1 )\ < M f \xi - x i+1 \ a = M f n- a , we have 

/n—l \ 2 /n—l \ 2 



^ 4 [Y\ K i(x*)\h M fa- 2a ) <^M 



Note that for any x,y G [0, 1], Taylor's theorem yields 



L/3J 



F(x)-F(y)-^— f±(x-y) 



< 



x ( X _ U )I^J-1 

(L/5J-1)! 
x (a;_ M )Lfl-i 



(y(L^J)( u )-y(L/3J)( y ))^ 
M v \x - yf-W du 



<^\x-yf 



So, 



i + 1 



n 



V(x, 
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^E 



3=1 



x* ) + 

n 



n 



P 1 
+ -M y 



i + 1 



and 



lWv®(x.)ffi 



Vi-V(x*)>-J2 

1 3=1 J 



X* 1 + 

71 



i + 1 



??. 



1 



-My 



n 



-My 



i + 1 



7) 



Since the kernel functions have vanishing moments, for j = 1, 2, . . . , |_/3J , 
when n large enough 



71— 1 r ■ \ ' 

E^*)(~-H 

i=i v fi 7 



»{xi+x i+1 )/2 1 

-If 



E 



1 J(x i +x i _ 1 )/2 /l 



/ x* — It \ 




V h ) 







/? 



^1 /-(Xi+Xi+l)^ i 



+ E 



—K 



J./(xi+Xj_i)/2 h 
(xi+x i+1 )/2 i 

K 



i 

71-1 

E , 

i=1 ^(K+a!i_i)/2 ^ 
^ C E 



X,| 

n 



(u — x*)- 7 



r/i/ 



J? 



du 



i=l 



(xi+Xi-i)/2 



1 / X* — U 



h\ h 



x — du = c'n 1 
n 



for some generic constant d > 0. Similarly, J2?=i Ki"{ x *){^- ~ X *V — c ' n 1 - 
So, 



71-1 



Vj=i 



i=l 



j 



X* + 



i + 1 



.7- ^ 



<Cn~ 



for some constant C > which does not depend on x*. Note that V^J sat- 
isfies Holder condition with exponent < a' = a — [a\ < 1 and is, therefore, 
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continuous on [0, 1] and bounded. Then we have 



fn-\ 



4 (J2Khx*m-V(x*)) 



i 


P 


i + l 






+ 




0) 


n 




n 





\i=l / 

<2C 2 n~ 2 + 2M 2 ( Y \ K i&* 

\i=\n(x»— h)\ 
An{x,+h)\+l 

<2C 2 n~ 2 + 2M 2 [ Y \ K ?( X * 

\i=[n(x t —h)\ 

< 2C 2 n~ 2 + 8 x ^MlrT^I^) x (2k). 

The last inequality is due to the fact < h + < h + ^ < 3h. On the other 
hand, notice that e±, €3, 65, . . . are independent and £2, £4, . . . are indepen- 
dent, we have 





P 








+ 




0) 


n 




n 





<n-\ 



tn-\ 



4£ Y K^x^SiV-'ei = 4Var Y K^{x,)V28 i V- /2 e i 

\i=\ / \i=l 

Ln(as»+/i)J+l 

< 16 £ (KH**)) 2 SiVi 

i= \_n(x* — /i)J 

< 16M 2 M v n- 2a - 2 ' 3 ^ 1 + 2 ^ x jfe 



and 



(n-\ 



*E\Y K ii^)Vi{4 - 1)J = 4Var( £ itf (s.W(e? - 1)J 

<8M 2 M4 £W(^)) 2 
1=1 

1 - 

< SMlui—k 
nh 

= 8M 2 ^n" 2 ^ 1 ^ x jfe 

where ^4 denotes the uniform bound for the fourth moments of the e,. 

Putting the four terms together we have, uniformly for all £ [0,1], 
/ G A Q (M / ) and V G A' 3 (My), 

£(V>*) - V>*)) 2 

< 2kMjn~* a + 2C 2 n" 2 + 8 x 3 2/3 M 2 n" 2 ^ 1+2 ^ x (2fc) 
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+ SM^n-W+^fe + l6MjM v n- 2a - 2 M 1+2 Vk 
= C -max{n- 4Q ,n- 2 ' 3 /( 1+2 ' 3 )} 
for some constant Cq > 0. This proves (8). 

6.2. Lower bound: Proof of Theorem 2. We shall only prove the lower 
bound for the pointwise squared error loss. The same proof with minor mod- 
ifications immediately yields the lower bound under the integrated squared 
error. Note that, to prove inequality (13), we only need to focus on the 
case where a < 1/4, otherwise n~ 2 ^^ 1+2/3 ^ is always greater than n _4a for 
sufficiently large n and then (13) follows directly from (12). 

For a given < a < 1/4, there exists an integer q such that (q + l)a > 1. 
For convenience we take q to be an odd integer. From Lemma 1, there is a 
positive constant B < oo and a symmetric distribution G on [— B,B] such 
that G and iV(0, 1) have the same first q moments. Let rj, i = l,...,n, 
be independent variables with the distribution G. Set 6 n = -jjfn"", /o = 0) 
V (x) = l + 9 2 and V 1 (x) = 1. Let g(x) = 1 - 2n\x\ for x £ [-^, ^] and 
otherwise. Define the random function f\ by 

n 

f x {x) = 0nng(x - xi)I(0 < x < l). 

i=l 

Then it is easy to see that f\ is in A a (My) for all realizations of r^. Moreover, 
fi(xi) = &nfi are independent and identically distributed. 
Now consider testing the following hypotheses: 

Ho :yi = fo(xi) + V 1/2 (xi)vi, i = l,...,n, 

Hi :yi = fi(xi) + Vi 2 (xi)ui, i = l,...,n, 

where Vi are independent A^(0, 1) variables which are also independent of 
the rj's. Denote by Pq and Pi the joint distributions of y^s under H$ and 
Hi, respectively. Note that for any estimator V of V, 

max{E(V(x*) - V (x*)) 2 , E(V(x*) - Vi(x*)) 2 } 

(17) > ^-p*(Po,Pi)(V (x*) - V^x,)) 2 

lb 

= —p\P ,P 1 ) f -rn- ia 

16 F v y 16S 4 

where p(Po, Pi) is the Hellinger affinity between Pq and Pi. See, for example, 
Le Cam (1986). Let po and p\ be the probability density function of Pq and 
Pi with respect to the Lebesgue measure p, then p(Po,Pi) = / ^popi dp. 
The minimax lower bound (13) follows immediately from the two-point 



16 



WANG, BROWN, CAI AND LEVINE 



bound (17) if we show that for any n, the Hellinger affinity p(Pq,P\) > C 
for some constant C > 0. (C may depend on q, but does not depend on n.) 
Note that under Hq, yi^ N(0, 1 + 9 2 ) and its density do can be written 

as 



do(t) 



1 



t 



ip(t — v9 n )ip(v) dv. 



Under Hi, the density of yi is di(t) = / tp(t — v9 n )G(dv). 

It is easy to see that p{Po,P\) = (j y/d^di d/i) n , since the y^s are inde- 
pendent variables. Note that the Hellinger affinity is bounded below by the 
total variation affinity 



dt. 



J y/daWdt^dt > 1 - \ J \do(t) - di(t)| 
Taylor's expansion yields 

where Hk(t) is the corresponding Hermite polynomial. And from the con- 
struction of the distribution G, 



So, 



v l G{dv)= \ vip(v)dv for i = 0, 1, . . . ,q. 



\do(t)-di(t)\ 

ip(t — v9 n )G(dv) — I tp(t — v 9 n )ip(v) dv 



(18) 



i=0 
oo 



Hi(t) 



v l 9 % n (p(y) dv 



i=0 



<p(t) e 



Hi(t) 



< 



i=9+l 
oo 



%G(dv) - / V (t) J2 



v l 9 % n ip(v) dv 



<p( t ) e 



i=q+l 



+ 



i=q+l 

oo 



<p(t) E ^r-v%<p(v)dv 



i=q+l 



Suppose q + 1 = 2p for some integer p, it can be seen that 

oo 

<p(*)E 



E ^rv%G(dv) 



i=q+l 



H2i{t) 9%v 2i G{dv) 



(2*) 



< 



<p(*)E 



(2i)\ n 



v 2i G{dv) 
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< 



and 



i=p 



< 



v(*)E 



i=p 

oo 



(2i)\ n 



v l ip(v) dv 



< 



H 2i (t) 



oo 



2*-i! 



|2/ 



So from (18), 

oo 

Mo(*)-di(*)| <¥>(*)£ 
and then 



#a(f) 



(2i)! 



0^ + v(t ) 2 



H 2i (t) 



322 
II 



d (t)di(t) dt 
1 



oo 



(19) >i-^/(^Ef#^ 2t + ^)E 



H 2i (t) 




2*-i! 





fl"2i(t) 



2* ■ i! 



i2* 



dt. 



Since / t 2 V(t) dt = (2i - 1)!! where (2i - 1)!! = (2i - 1) x (2i - 3) x ■ • ■ x 3 x 1, 
for the Hermite polynomial i?2i we have 



(p(t)\H 2i (t)\dt 



<p(t) 



(2i-l)!! x 



\ + (-2) fc i(^-l)---(i-fc + l) t2fc 



k=l 



(2fc)! 



dt 



(2i - 1)!! x 1 + E — 



fc=i 



(2fc)! 



dt 



(2l _ 1)Mx ( 1+ £^- 1 >-( i -*+ 

V k=l 



t 2k ip(t) dt I 
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(2» - 1)!! X 1 + E TTTTVi — (2fc - 1)!! 

V k=l 



(2i-l)!!x 1 + E 

\ k=l 



(2k)\ 

k + l)\ 



k\ 



= 2 i x (2t-l)H. 

For sufficiently large n, 9 n < 1/2 and it then follows from the above inequal- 
ity that 



l=p 



l=p 



<Y ] ^i-^r2 i x (2i-l)!! 
(2t)! v 



n Iw 7i 



i=p 



and 



i=p ' i=p 



oo /j2i 



i=p 



i=p 



#E 



(2i-l)H^_ 



2p 



oo 



<eE 2ix ^" 2p <€ p E 2ix 



= 2p x2 2 P+ l 

Then from (19) 

/ y/do(t)di(t)dt > 1 - ^ p (^e s2 + 2 2p ) 4 1 - c ^ +1 , 
where c is a constant that only depends on g. So 



p(Po,Pi) 



yJdo(t)dl(t)dty> (1 - c^ +1 ) n = (1 " cn- Q (<? +1 )f . 



Since a(g + 1) > 1, lim_ >00 (l — cn Q (<?+ 1 )) n > e c > and the theorem then 
follows from (17). 
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